Once your data is read in and available as a DataFrame, Pandas provides a whole suite of tools for extracting information from it.
Let’s start by looking at some example data which contains information about the amounts that people at a restaurant paid and tipped for their meals:
import pandas as pdtips = pd.read_csv("./data/tips.csv")tips
total_bill
tip
day
time
size
0
16.99
0.71
Sun
Dinner
2
1
10.34
1.16
Sun
Dinner
3
2
21.01
2.45
Sun
Dinner
3
3
23.68
2.32
Sun
Dinner
2
4
24.59
2.53
Sun
Dinner
4
...
...
...
...
...
...
239
29.03
4.14
Sat
Dinner
3
240
27.18
1.40
Sat
Dinner
2
241
22.67
1.40
Sat
Dinner
2
242
17.82
1.22
Sat
Dinner
2
243
18.78
2.10
Thur
Dinner
2
244 rows × 5 columns
The first thing that you usually want to be able to do is to pull apart the overall table to get at specific bits of data from inside.
When using lists and dicts in Python, the square-bracket syntax was used to fetch an item from the container. In Pandas we can use the same syntax but it’s a much more powerful tool.
If you pass a single string to the square brackets of a DataFrame it will return to you just that one column:
Accessing a column like this returns an object called a Series which is the second of the two main Pandas data types. Don’t worry too much about these just yet but think of them as being a single column of the DataFrame, along with the index of the DataFrame.
If you pass a list of column names to the square brackets then you can grab out just those columns:
tips[["total_bill", "tip"]]
total_bill
tip
0
16.99
0.71
1
10.34
1.16
2
21.01
2.45
3
23.68
2.32
4
24.59
2.53
...
...
...
239
29.03
4.14
240
27.18
1.40
241
22.67
1.40
242
17.82
1.22
243
18.78
2.10
244 rows × 2 columns
Aside: Indexing
Note that the outer square brackets are saying “I’m selecting data” and the inner square brackets are saying “I’m giving you a list of column names”.
In this case it gives you back another DataFrame, just with only the required columns present.
Getting rows
If you want to select a row from a DataFrame then you can use the .loc (short for “location”) attribute which allows you to pass index values like:
tips.loc[2]
total_bill 21.01
tip 2.45
day Sun
time Dinner
size 3
Name: 2, dtype: object
If you want to grab a single value from the table, you can follow the row label with the column name that you want:
tips.loc[2, "total_bill"]
21.01
Exercise
The size column in the data is the number of people in the dining party. Extract this column from the DataFrame.
Answer
import pandas as pdtips = pd.read_csv("./data/tips.csv")
Now that we know how to refer to individual columns, we can start asking questions about the data therein. If you’ve worked with columns of data in Excel for example, you’ve probably come across the SUM() and AVERAGE() functions to summarise data. We can do the same thing in pandas by calling the sum() or mean() methods on a column:
tips["total_bill"].sum()
4827.77
tips["total_bill"].mean()
19.78594262295082
You can see a list of all the possible functions you can call in the documentation for Series. So for example, you can also ask for the maximum value from a column with the max() method.
tips["tip"].max()
7.0
In some situations, you don’t just want to get the value of the maximum, but rather to find out which row it came from. In cases like that there is the idxmax() method which give you the index label of the row with the maximum:
tips["total_bill"].idxmax()
170
So we know that the value of the maximum bill was £7 and it was found in the row with the label 170.
You can then use this information with the .loc attribute to get the rest of the information for that row:
Functions like sum() and max() summarise down the column to a single value. In some situations we instead want to manipulate a column to create a new column.
For example, the data in the table is in British pounds. If we wanted to convert it into the number of pennies then we need to multiply each value by 100. In Pandas you can refer to an entire column and perform mathematical operations on it and it will apply the operation to each row:
The data in row 0 was previously 16.99 but the result here is 1699.0, and likewise for every other row.
You can do any mathematical operation that Python supports, such as +, - and /.
Combining columns
Aside
Columns are actually combined by matching together their index labels, not strictly by their position in the column.
As well as operating on individual columns, you can combine together multiple columns. Any operation you do between two columns will be done row-wise, that is adding two columns will add together the two values from the first row of each, then the second row from each etc.
For example if we wanted to find out, for each entry in our table what the ratio between tip amount and total bill was, we could divide one column by the other:
It can get messy and hard-to-read doing too many things on one line, so it’s a good idea to split each part of your calculation onto its own line, giving each step its own variable name along the way.